This barchart represents the number of taxi trips made in Chicago per year. We see that the in 2014 the frequency of taxis was at its peak and is decreasing since.
Through the plot below we can study the monthly behaviour of taxi frequencies for each year. We notice that between 2013 and 2019 the taxi trips frequency follow a similar behaviour, this may suggest that seasonality may have an influence on the taxi trips frequency. For example, every March, May and October the taxi frequency increased and the fequency seems to decrease for the months of June, July and November. However, We notice that years 2020 and 2021 seem to have a different pattern than the rest of the months.
In order to see the monthly effect on the taxi trips frequency we can plot the frequency through a heatmap and see its changes over the years and months.
In this plot, the high frequency is visualized by a a dark colour and the low frequency by a bright colour. In this plot, we can compare the frequency by year and by month. We notice that there may be a monthly (seasonal) effect on the taxi frequency, for example we see that in March and October the pattern seems to be darker than for the rest of the months.
## [1] 106 2
To explore in more details the relationship between time and taxi frequency we decide to analyze the taxi trips frequency as time series. Without any estimation we notice that there seems to be a decreasing trend after 2014.
p <- ggplot(by_y_m, aes(x = ym, y = count)) +
geom_line() + geom_smooth()+ theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 0.1, hjust=0.5)) + scale_y_continuous(labels =comma) + xlab("Date") + ylab("Frequency") + labs(title = "Frequency of taxis between 2013 and 2021")
ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To analyze further the change in behaviour of the taxi trips frequency in Chicago we may group our data by districts. In the original dataset we have the information about the community areas, but since the number of areas is very high (77 areas) and it is difficult to visualize frequencies of 77 categories in one plot, we suggest to group our data using another geographical feauture: districts. According to Chicago 77 (http://www.thechicago77.com/chicago-neighborhoods/), we can regroup the areas into 9 districts. We now get a new variable: district which is the district in which the taxi made a pick up.
By plotting the frequencies of taxi trips through time for each district in one plot, we can assess the differences in the decrease of the frequency of taxi trips in each district and compare them. We notice that the districts which originally had a very small number of taxi trips (for example, Central, Near North, and Near South Side) didn’t had such a strong decrease of frequency as the districts that had higher values between 2013 and 2015 (for example, Far North Side).
In this part we suggest to consider that there may be a relation for the taxi frequency through time and propose to perform a time series analysis to estimate the trend, the seasonal effects and see whether the observations are correlated through time and can be represented as a particular time series model.
We first estimate a linear model with a liner trend and dummy variables for months (where we omit the month of January) as covariates. The output is the taxi trips frequency.
## MODEL INFO:
## Observations: 106
## Dependent Variable: number_of_taxis
## Type: OLS linear regression
##
## MODEL FIT:
## F(12,93) = 18.7396, p = 0.0000
## R² = 0.7074
## Adj. R² = 0.6697
##
## Standard errors: OLS
## ------------------------------------------------------------------
## Est. S.E. t val. p
## ----------------- -------------- ------------- ---------- --------
## (Intercept) 2622312.0927 178247.0026 14.7117 0.0000
## time -23006.6277 1542.9894 -14.9104 0.0000
## monthFeb 55434.5166 228284.2278 0.2428 0.8087
## monthMar 274823.8110 228299.8710 1.2038 0.2317
## monthApr 179397.8832 228325.9407 0.7857 0.4340
## monthMay 326897.9554 228362.4332 1.4315 0.1556
## monthJun 362441.8054 228409.3436 1.5868 0.1160
## monthJul 266674.3220 228466.6654 1.1672 0.2461
## monthAug 319552.8386 228534.3909 1.3983 0.1654
## monthSep 286716.3553 228612.5108 1.2542 0.2129
## monthOct 401104.8719 228701.0143 1.7538 0.0828
## monthNov 277457.6776 235385.5510 1.1787 0.2415
## monthDec 266354.1804 235431.0621 1.1313 0.2608
## ------------------------------------------------------------------
The variable time (which represents the trend) is negative and significant at 1% level, meaning that overall, we observe a decreasing trend through time. During the sample period, the number of taxi trips in Chicago decreased by an average of 23006 trips per month. The dummy variables representing the months (the seasonal components) are not significant and therefore do not suggest a difference in frequency of taxi trips depending on month in this model.
We consider to fit a second model as it may be more appropriate to consider the trend as quadratic. From the previous graphics, we notice that before 2014 the number of taxi trips seemed to increase and after 2014 the number of taxi trips decreased.
## MODEL INFO:
## Observations: 106
## Dependent Variable: number_of_taxis
## Type: OLS linear regression
##
## MODEL FIT:
## F(13,92) = 55.4318, p = 0.0000
## R² = 0.8868
## Adj. R² = 0.8708
##
## Standard errors: OLS
## ------------------------------------------------------------------
## Est. S.E. t val. p
## ----------------- -------------- ------------- ---------- --------
## (Intercept) 1820730.0989 129757.7391 14.0318 0.0000
## time 22566.8437 3896.3921 5.7917 0.0000
## I(time^2) -425.9203 35.2803 -12.0725 0.0000
## monthFeb 52027.1543 142778.5245 0.3644 0.7164
## monthMar 268860.9269 142788.8837 1.8829 0.0629
## monthApr 171731.3179 142805.7465 1.2026 0.2322
## monthMay 318379.5495 142828.9013 2.2291 0.0282
## monthJun 353923.3995 142858.2406 2.4774 0.0151
## monthJul 259007.7567 142893.7606 1.8126 0.0732
## monthAug 313589.9545 142935.5611 2.1939 0.0308
## monthSep 283308.9929 142983.8456 1.9814 0.0505
## monthOct 401104.8719 143038.9208 2.8042 0.0062
## monthNov 182051.5317 147431.6610 1.2348 0.2200
## monthDec 170948.0345 147460.0845 1.1593 0.2493
## ------------------------------------------------------------------
In this second model, the trend and the trend squared are both significant at 1% level, as well as some of the dummy variables representing months: May, June, Augusr, September and October, meaning that there is a significant difference in taxi trips frequency in these month compared to the month of January. The R squared in the second model is higher than in the first model, meaning that the the model with the squared trend fits better the observed data.
To verify the assumptions of the linear model (the residuals are normally distribued with a mean equal to 0 and a constant variance), we perform a residual analysis. In the first plot representing the residuals versus fitted values, we notice that the average mean of the residuals is close to 0. When the fitted values are small and high, the residuals are spread, in the middle of the plot the residuals seem to be dense. The spread is not constant, and we may suspect a heteroskedasticity issue in our model. Now looking at the qq plot we see that the points fall alonf the line in the middle of the graph, but not at the extremeties. This suggests that we may have too many extreme values.
In this part we represent the autocorrelation function and the partial autocorrelation functions which provide information about the dependence structure the time series may have. From the plots we may have a hint that we are in presence of a autoregressive model of order 1.
##
## Call:
## arima(x = residuals, order = c(2, 0, 0), n.cond = 3)
##
## Coefficients:
## ar1 ar2 intercept
## 1.0770 -0.2058 -21593.73
## s.e. 0.0991 0.1026 90775.23
##
## sigma^2 estimated as 1.577e+10: log likelihood = -1395.78, aic = 2799.55